Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 4 - More Comfortable with SQL?")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

df = spark.read.csv(path, header=True)
df.toPandas()
id species_id name birthday color
0 1 1 King 2014-11-22 12:30:31 brown
1 2 3 Argus 2016-11-22 10:05:10 None

Register DataFrame as a SQL Table

df.createOrReplaceTempView("pets")

What Happened?

The first step in making a df queryable with SQL is to register the table as a sql table.

This particular function will replace any previously registered local table named pets as a result. There are other functions that will register a dataframe with slightly different behavior. You can check the reference docs if this isn't the desired behavior: docs

Let Write a SQL Query!

df_2 = spark.sql("""
SELECT 
    *
FROM pets
WHERE name = 'Argus'
""")

df_2.toPandas()
id species_id name birthday color
0 2 3 Argus 2016-11-22 10:05:10 None

What Happened?

Once your df is registered, call the spark sc function on your spark session object. It takes a sql string as an input and outputs a new df.

Conclusion?

If you're more comfortable with writing sql than python/spark code, then you can do so with a spark df! We do this by:

  1. Register the df with df.createOrReplaceTempView('table').
  2. Call the sql function on your spark session with a sql string as an input.
  3. You're done!

results matching ""

    No results matching ""